Home Credit Default Risk (HCDR)

The course project is based on the Home Credit Default Risk (HCDR) Kaggle Competition. The goal of this project is to predict whether or not a client will repay a loan. In order to make sure that people who struggle to get loans due to insufficient or non-existent credit histories have a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

Team Information

WhatsApp%20Image%202022-05-01%20at%209.56.29%20AM.jpeg

Some of the challenges

  1. Dataset size
    • (688 meg uncompressed) with millions of rows of data
    • 2.71 Gig of data uncompressed

Kaggle API setup

Kaggle is a Data Science Competition Platform which shares a lot of datasets. In the past, it was troublesome to submit your result as your have to go through the console in your browser and drag your files there. Now you can interact with Kaggle via the command line. E.g.,

! kaggle competitions files home-credit-default-risk

It is quite easy to setup, it takes me less than 15 minutes to finish a submission.

  1. Install library

For more detailed information on setting the Kaggle API see here and here.

Dataset and how to download

Back ground Home Credit Group

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.

Home Credit Group

Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Background on the dataset

Home Credit is a non-banking financial institution, founded in 1997 in the Czech Republic.

The company operates in 14 countries (including United States, Russia, Kazahstan, Belarus, China, India) and focuses on lending primarily to people with little or no credit history which will either not obtain loans or became victims of untrustworthly lenders.

Home Credit group has over 29 million customers, total assests of 21 billions Euro, over 160 millions loans, with the majority in Asia and and almost half of them in China (as of 19-05-2018).

While Home Credit is currently using various statistical and machine learning methods to make these predictions, they're challenging Kagglers to help them unlock the full potential of their data. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

Data files overview

There are 7 different sources of data:

Downloading the files via Kaggle API

Create a base directory:

DATA_DIR = "../../../Data/home-credit-default-risk"   #same level as course repo in the data directory

Please download the project data files and data dictionary and unzip them using either of the following approaches:

  1. Click on the Download button on the following Data Webpage and unzip the zip file to the BASE_DIR
  2. If you plan to use the Kaggle API, please use the following steps.

Data Import

Data files overview

Data Dictionary

As part of the data download comes a Data Dictionary. It named HomeCredit_columns_description.csv

Application Train

Application test

The application dataset has the most information about the client: Gender, income, family status, education ...

The Other datasets

Exploratory Data Analysis

Summary of Application train

Observation 1

Days Employed
Own Car Age

Observation 2

Contract Type with Amount Credit and Code Gender

Summary of bureau

Summary of bureau_balance

Observation 4

Summary of credit_card_balance

Summary of installments_payments

Summary of POS_CASH_balance

Missing data for application train

Distribution of the target column

Correlation with the target column

Pair based visualization

Applicants Age

Applicants occupations

Dataset questions

Unique record for each SK_ID_CURR

previous applications for the submission file

The persons in the kaggle submission file have had previous applications in the previous_application.csv. 47,800 out 48,744 people have had previous appications.

Histogram of Number of previous applications for an ID

Can we differentiate applications by low, medium and high previous apps?
* Low = <5 claims (22%)
* Medium = 10 to 39 claims (58%)
* High = 40 or more claims (20%)

Feature Engineering

Feature engineering is the process of selecting, manipulating, and transforming raw data into features that can be used forr classification. It is producing new features for both supervised and unsupervised learning, with the goal of simplifying and speeding up data transformations while also enhancing model accuracy. A terrible feature may have a direct impact on your model. Therefore, feature engineering is a key step to any machine learning model.

For HCDR as well, feature engeenering turns out to be the game changer. There are various features from various datasets that may or may not impact the target variable. Therefore, it is important to create feature families to experiment different model settings to obtain an accurate classifier.

Feature engineering includes -

Joining secondary tables with the primary table

In the case of the HCDR competition (and many other machine learning problems that involve multiple tables in 3NF or not) we need to join these datasets (denormalize) when using a machine learning pipeline. Joining the secondary tables with the primary table will lead to lots of new features about each loan application; these features will tend to be aggregate type features or meta data about the loan or its application. How can we do this when using Machine Learning Pipelines?

Joining previous_application with application_x

We refer to the application_train data (and also application_test data also) as the primary table and the other files as the secondary tables (e.g., previous_application dataset). All tables can be joined using the primary key SK_ID_PREV.

Let's assume we wish to generate a feature based on previous application attempts. In this case, possible features here could be:

To build such features, we need to join the application_train data (and also application_test data also) with the 'previous_application' dataset (and the other available datasets).

When joining this data in the context of pipelines, different strategies come to mind with various tradeoffs:

  1. Preprocess each of the non-application data sets, thereby generating many new (derived) features, and then joining (aka merge) the results with the application_train data (the labeled dataset) and with the application_test data (the unlabeled submission dataset) prior to processing the data (in a train, valid, test partition) via your machine learning pipeline. [This approach is recommended for this HCDR competition. WHY?]

I want you to think about this section and build on this.

Roadmap for secondary table processing

  1. Transform all the secondary tables to features that can be joined into the main table the application table (labeled and unlabeled)
    • 'bureau', 'bureau_balance', 'credit_card_balance', 'installments_payments',
    • 'previous_application', 'POS_CASH_balance'

Feature set - 1

Feature set 1 configuration

Missing values in feature set - 1

feature engineering for prevApp table

feature transformer for prevApp table

Join the labeled dataset

Subset of features considered from each dataset

Join the unlabeled dataset (i.e., the submission file)

Finding correlations of features from all other datasets

Correlation of previous application and target

Correlation of bureau and target

Correlation of bureau balance and target

Correlation of POS Cash Balance and target

Correlation of installments payments and target

Correlation of Credit card balance and target

Feature Aggregation

Engineering new features to find out percentages

Different set of features can be used to create a new feature that might be helpful for classification. After data analysis, we have found the following three features those can be engineered -

Income Credit percentage - Total income / Credit amount

Average family member income - Total family income / count of family members

Annuity income percentage - Annuity / Total income

Occupation feature engineering

To perform Feature aggregation at various levels

Second level of data

Feature set - 2

Feature set - 2 configuration

Third level of data

New payment features

payment_diff_curr_pay = total current payment - current payment

payment_diff_min_pay = total current payment - installment minimum regularity

To identify numerical features

To identify categorical features

Aggregated features

To merge third level of data of Previous Apps

To merge third level of data of Bureau

Features engineered using aggregate functions

New features from aggregated features - to include Average and Range

Feature set - 3

Feature set 3 configuration

Merging the data

New features - to include Percentages

Days employed percentage = number of days employed / number of days lived

Credit income percentage = credit amount / total income

Annuity income percentage = Annuity amount / total income

Adding Polynomial features

Adding polynomial features for EXT_SOURCE_1, EXT_SOURCE_2, EXT_SOURCE_3, DAYS_BIRTH

Appending polynomial features

List of numerical features

List of categorical features

Total number of features

Correlations with all features included

Final feature set

Final feature set configuration

Final list of columns - column type

Number of numerical and categorical features in final list

Processing pipeline

Feature Selection

We have selected the top 10 correlated features for building the baseline pipeline

HCDR preprocessing

Experiment log

Baseline Model

To get a baseline, we will use some of the features after being preprocessed through the pipeline. The baseline model is a logistic regression model

Evaluation metrics

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

The SkLearn roc_auc_score function computes the area under the receiver operating characteristic (ROC) curve, which is also denoted by AUC or AUROC. By computing the area under the roc curve, the curve information is summarized in one number.

from sklearn.metrics import roc_auc_score
>>> y_true = np.array([0, 0, 1, 1])
>>> y_scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> roc_auc_score(y_true, y_scores)
0.75

Binary classification loss functions that can be used to calculate the loss/error and to update/rearrange the feature weights accordingly are -

Loss Functions

Binary classification loss functions that can be used to calculate the loss/error and to update/rearrange the feature weights accordingly are -

Cross - entropy loss This is the default loss function used for most of the binary classification problems. It is similar to maximum likelihood as it is used to calculate a score that summarizes the average difference between the actual and predicted probability distributions for predicting class.

Hinge loss - This loss function is used for SVM models. It checks the examples to have the correct sign, assigning more error when there is a difference in the sign between the actual and predicted class values.

Squared-hinge loss A popular extension to Hinge loss is called the squared hinge loss that simply calculates the square of the score hinge loss.

We have used cross-entropy loss function which would be most suitable loss function for this binary classification problem.

Few metrics are sensitive to imbalance in the data which should not affect the performance of the model. Therefore, metric measures that are helpful are -

Train, Test Datasets

Pipeline Definition

Resampling minority class

Baseline metrics

Baseline Experiments

Hyperparameter Tuning

Logistic Regression Model

Best parameters: predictorC: 0.1 , predictorpenalty: l2 , predictor__tol: 1e-05

Gradient Boosting

Best parameters : predictormax_depth: 10 , predictormax_features: 10 , predictormin_samples_leaf: 5 , predictorn_estimators: 1000 , predictorn_iter_no_change: 10 , predictorsubsample: 0.8 , predictortol: 0.0001 , predictorvalidation_fraction: 0.2

XGBoost

Best parameters : predictormax_depth: 10 , predictormax_features: 10 , predictormin_samples_leaf: 5 , predictorn_estimators: 1000 , predictorn_iter_no_change: 10 , predictorsubsample: 0.8 , predictortol: 0.0001 , predictorvalidation_fraction: 0.2

Additional Experiments

Support Vector

Best parameters : predictormax_depth: 10 , predictormax_features: 10 , predictormin_samples_leaf: 5 , predictorn_estimators: 1000 , predictorn_iter_no_change: 10 , predictorsubsample: 0.8 , predictortol: 0.0001 , predictorvalidation_fraction: 0.2

Logistic regression with PCA

Best Parameters : predictorC: 0.1 , predictorpenalty: l2 , predictor__tol: 0.0001

Model Validation

Families of Input Features Used

Results and Analysis

Box-Plot

AUC (Area Under the ROC Curve)

Precision Recall Curve

Confusion Matrix

Final Results

Final Model Tuned

Submission File Prep

For each SK_ID_CURR in the test set, you must predict a probability for the TARGET variable. The file should contain a header and have the following format:

SK_ID_CURR,TARGET
100001,0.1
100005,0.9
100013,0.2
etc.

Kaggle submission via the command line API

Phase 3 - Multi Layer Perceptron (Neural Network)

Data Preparation

Neural Network Model with no hidden layer

MLP Model with 1 hidden layer

MLP results

Kaggle Submission

report submission

Click on this link

Write-up

For this phase of the project, you will need to submit a write-up summarizing the work you did. The write-up form is available on Canvas (Modules-> Module 12.1 - Course Project - Home Credit Default Risk (HCDR)-> FP Phase 2 (HCDR) : write-up form ). It has the following sections:

Abstract

The main goal of this project is to use a machine learning model for historical loan application data to predict if the customer will be able to repay a loan.

Phase 2:

The main aim of this phase is to include data modeling and perform feature engineering. As an extension to the Visual EDA driven feature sampling and baseline model development, the focus for this phase included data modeling. So here, data modeling is used to combine all the available data sets and feature engineering is done considering polynomial, aggregated, numerical and categorical experiments. Should do experimental analysis for hyper-parameter tuning for Logistic Regression, XGBoost. We conducted experiments using both original imbalanced data as well as resampled data.

What you did (main experiments) The data sets consist of three levels of data, So, firstly we have combined combinations of different levels of data and correlations were considered from those combinations. So, multiple feature families were created by considering categorical, numerical, aggregated & polynomial features implementations on multiple feature families. Then these features were given as input to the pipeline and the best feature was chosen. In this phase feature engineering is very important because there is a huge amount of data and various feature families can be created and choosing the best impacted feature family is a huge task so domain knowledge is very important. Then after choosing the best feature family, these machine learning models are used for pipeline (Baseline model, XGBoost and PL Model) and hyper parameter tuning is done on that.

What were your results/findings (best pipeline and the corresponding public, private scores) Our result for this phase shows that the best performing algorithm was XGBoost. Which has the best AUC ROC score as 71.85%. The lowest performing algorithm was the SVM modelw. The best score in Kaggle submission out of all four submissions was 0.72720 for private and 0.73006 for public.

Problems you are tackling: The main problem faced while working on this project is the data. This is a very large data set. Secondly, there is a lot of feature engineering resulting in very less increase in the test accuracy.

Phase 3
We have explored the concept of Deep learning which is about learning from past data using artificial neural networks with multiple hidden layers (2 or more hidden layers). Deep neural networks un crumple complex representation of data step-by-step, layer-by-layer (hence multiple hidden layers) into a clear representation of the data. Artificial neural networks having one hidden layer apart from input and output layer is called as multi-layer perceptron (MLP) network. We have added Single Layer neural network and multi layer neural network model. We have done resampling on the data to balance the data points from both the classes.The deep learning Kaggle score fell short for the ensemble model. This results clearly show that Neural network may not be a good choice for supervised binary classification always. Simple methods like Logistic Regression and Gradient methods like XGBoost did out perform Neural Network model. We used XGBoost to predict loan default and were able to reach an AUC score of over 0.72 on our Kaggle submission. After focusing on exploratory data analysis. Along with building upon our additional features and developing boosted models, In phase 3 our team implemented multiple Multi Linear Perceptron models, experimenting with different architectures and activation types. Our top performing MLP model was built using Pytorch and the test ROC-AUC score is 0.767 which is the highest among all the models.

WhatsApp%20Image%202022-05-01%20at%209.57.33%20AM.jpeg

Project Description

Description of Data:

The complete given dataset consists of 7 .csv files; that is they are 7 tables. In these 7 tables, The application train/test table is the primary table and the remaining 6 tables are the secondary/supporting tables.

Primary Tables: The Application train and Application test are the main tables which consist of information about each of the applications at Home credit. The primary key/feature of this table is SK_ID_CURR which is used to uniquely identify each entry of a loan. Training Application (application_train): Here the training application data comes from TARGET which has two: 0 or 1. 0 indicates that the loan was repaid perfectly without any problems or delay. 1 indicates that the loan was not repaid and that there was some difficulty paying back the loan amount or the installments were paid back with some delay in time. In this file/table, the number of variables are 122 and the number of data entries are 307,511. Testing Application (application_test): The testing application consists of the same features as the training application except the TARGET feature. In this file/table, the number of variables are 121 and the number of data entries are 48,744.

Secondary Tables: The following are the 6 secondary tables:

  1. Bureau(bureau.csv): The bureau table consists of the client's previous credits which are received from the other financial institutions before applying for this loan. Each previous credit has a record/row and each loan in the application data can have multiple previous credits. The application_{train/test} table is joined with the bureau table by a primary key SK_ID_CURR. The number of variables are 17 and the number of data entries are 1,716,428.

  2. Bureau Balance (bureau_balance.csv): The bureau balance table includes the entries of monthly balances of the client's previous credits which are received from the other financial institutions. The bureau table is joined with this table by using SK_ID_BUREAU where SK_ID_BUREAU is unique in the bureau table where has it is a foreign key in bureau_balance table which creates a many-t0-many relation. The number of variables are 3 and the number of data entries are 27,299,925.

  3. Previous Application (previous_application.csv): This table consists of the previous applications made by the customers at Home Credit. The table can be joined with the primary table by SK_ID_CURR which means one row of each previous application forms a many-to-many relation. The total number of variables are The number of data entries are 1,670,214. There are four types of contracts:

    - Consumer loan(POS – Credit limit given to buy consumer goods)
    - Cash loan(Client is given cash)
    - Revolving loan(Credit)
    - XNA (Contract type without values)
  4. Pos Cash Balance (POS_CASH_Balance.csv): This table consists of a monthly balance snapshot of a previous point of sale or cash loan that a customer has at Home Credit. This table can be joined with previous_application table using the primary key SK_ID_PREV, which means there is one row for each monthly balance having a many-to-one relationship with the previous_application table. The number of variables are 8 and the number of data entries are 10,001,358.

  5. Installements Payments (installements_payments.csv): This table consists of past payment data for each installments of previous credits in Home Credit related to loans in our sample.This table can be joined with previous_application table using the primary key SK_ID_PREV, which means there is one row for each monthly balance having a many-to-one relationship with the previous_application table. The number of variables are 23 and the number of data entries are 3,840,312.

  6. Credit Card Balance (credit_card_balance.csv): This table consists of monthly balances of client’s previous credit loans in Home Credit. There is one row for every made payment and one row for every missed payment.This table can be joined with previous_application table using the primary key SK_ID_PREV, which means there is one row for each payment or missed payment having a many-to-one relationship with the previous_application table. The number of variables are 8 and the number of data entries are 13,605,401.

Screen%20Shot%202022-05-01%20at%2010.50.15%20AM.png

Tasks to be tackled:

Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders. So, Home Credit(An international non-bank financial institution) strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data including telco and transactional information to predict their clients' repayment abilities. Home Credit primarily focuses on lending people money regardless of their credit history. So, we have a dataset on Kaggle with the objective of identifying and solving unfair loan rejection of Home Credit by not just considering the credit history. The main aim of this project is to predict the applicant/customer behaviors on loan repayment using Machine learning model. So firstly we will create a balanced dataset by handling missing values and doing correlational analysis on the dataset given. Then we create a set of final features including Numerical and Categorical feature pipeline based on correlational score. Then the data pipeline and baseline LR model is trained and this baseline model is evaluated. After this process the best LR model is chosen and then based on this and various performance metrics, the best prediction is made. The results of the machine learning pipelines are measured using Confusion matrix, Precision, F1 score, Accuracy Score, Area under ROC curve and recall. Businesses will be able to use the output of the model to identify if the loan is at risk to default. The new model built ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.

After focusing on exploratory data analysis, feature selection and preliminary modelling, in this last phase of the HCDR project, our work focused on three parts Single Layer Neural network Multi Layer Neural Network MLP-model building and applying classification Technique Single Layer Neural network In it we transform data using data pipeline and convert it into Tensor for neural network pipeline.Linear layer is used to create the probability of prediction. Multi Layer Neural Network The model contains 2 linear layers and one hidden layer with Relu function. MLP-model building and applying classification Technique In phase 3 of the project, our goal is to build a multi-layer perception (MLP) classification model in Pytorch and use Tensorboard to monitor real-time training results. phase2, we applied the algorithm nn.Linear() and the activation function nn.ReLu() in the MLP model with 43 initial transformed features and 2 final output features. To visualise the real-time training results, TensorBoard was introduced to monitor the training loss (CrossEntropyLoss) and accuracy of each epoch. The test accuracy for the MLP with and without the hidden layer is 0.922. The ROC AUC score has increased for the MLP model.

Workflow:

This is the workflow which we have used for our project.

WhatsApp%20Image%202022-05-01%20at%2010.14.35%20AM.jpeg

Neural Network

After focusing on exploratory data analysis, feature selection and preliminary modelling, in this last phase of the HCDR project, our work focused on three parts
Single Layer Neural network
Multi Layer Neural Network
MLP-model building and applying classification Technique

Single Layer Neural network
In it we transform data using data pipeline and convert it into Tensor for neural network pipeline.Linear layer is used to create the probability of prediction.
Multi Layer Neural Network
The model contains 2 linear layers and one hidden layer with Relu function. MLP-model building and applying classification Technique In phase 3 of the project, our goal is to build a multi-layer perception (MLP) classification model in Pytorch and use Tensorboard to monitor real-time training results. phase2, we applied the algorithm nn.Linear() and the activation function nn.ReLu() in the MLP model with 43 initial transformed features and 2 final output features. To visualise the real-time training results, TensorBoard was introduced to monitor the training loss (CrossEntropyLoss) and accuracy of each epoch. The test accuracy for the MLP with and without the hidden layer is 0.922. The ROC AUC score has increased for the MLP model is 0.767.

Data Leakage:

Data Leakage is one of the leading machine learning errors. Data leakage in machine learning happens when the data that we are used to training a machine learning algorithm is having the information which the model is trying to predict, this results in unreliable and bad prediction outcomes after model deployment. In phase 2 and 3 we have handled the missing values present in the data by replacing few of them with mean and median values. We have also split the data into train, test and valid sets using fit transform. The data is standardized using StandardScalar and resampled the data to have equal number of points from both the classes. With all these factors into consideration, there is no considerable amount of the data leakage present in our modeled pipelines.

Modelling Pipeline

We have deployed pipelines to prevent data leaking during numeric and categorical feature preparation.

Phase-1 Logistic Regression Model is used as a baseline model since it is simple to develop and efficient. A logistic regression model does not require a lot of processing resources to train.

Phase-2 We'll look into different classification models to see if we can improve our forecast. Our main focus is on boosting algorithms, which are said to be extremely efficient and relatively fast. Gradient boosting, XGBoost, Light GBM, and SVM were the preferred techniques. The following are the reasons for selecting the models indicated. By generating an ensemble of weak predictors, Gradient Boosting creates a better predictive model. XGBoost is one of the quickest gradient boosted tree implementations and internally handles missing values.In many circumstances, LightGBM produces results that are more effective and faster than XGBoost while using less memory. When linear separation is required, SVM performs similarly to logistic regression, and depending on the kernel used, it also performs well with non-linear boundaries. Depending on the kernel, SVM is prone to overfitting/training difficulties. A Voting Classifier is a machine learning model that learns from an ensemble of different models and predicts an output based on the highest probability of the result being the target class.

Phase-3 We have explored the concept of Deep learning which is about learning from past data using artificial neural networks with multiple hidden layers (2 or more hidden layers). Deep neural networks un crumple complex representation of data step-by-step, layer-by-layer (hence multiple hidden layers) into a clear representation of the data. Artificial neural networks having one hidden layer apart from input and output layer is called as multi-layer perceptron (MLP) network. We have added Single Layer neural network and multi layer neural network model. We have done resampling on the data to balance the data points from both the classes.The deep learning Kaggle score fell short for the ensemble model. This results clearly show that Neural network may not be a good choice for supervised binary classification always. Simple methods like Logistic Regression and Gradient methods like XGBoost did out perform Neural Network model.

Results and Discussion

All the details of results and screenshots are below. Overall this project, we have used various feature selection techniques on 183 highly correlated feature model. XGBoost and Logistic regression, both have almost same Public and private scores.

Neural Network Our Simple Neural network ROC score was 76.21% and multilayer neural network ROC score was 72.60%. By this we can conclude that simple network performed better than multilayer neural network. Deep learning model training on full dataset look very less time that compared to the other classifiers like Linear regression, XG Boost and etc.

We have used many classifiers as follows:

  1. logistic Regression : This model was chosen as the baseline model trained with both balanced and imbalanced dataset with feature engineering. The training accuracy for this model 68.9% and test accuracy as 68%. A 75% ROC score resulted with best parameters for this model. The same model was run with PCA and the test ROC reduced to 69%.

  2. Gradient Boosting : Boosting did help in achieving better results. The results were good enough to continue in implementing & evaluating other boosting models. Training accuracy of 99.6% and test accuracy of 91.3% was achieved in this model. Test ROC under the curve for this model came out to 66%

  3. XGBoost : By far this model resulted in the best model. Both in terms of timing and accuracy for the selected features and balanced dataset. The accuracy of the training and test are 95.5% and test 83.4%. Test ROC under the curve is 63.6%.

  4. SVM : This was the lowest performing model in our experiment. Even after hyper-tuning RBF and poly kernels the results were not promising. The ROC score achieved for this model is 51.5%.

WhatsApp%20Image%202022-04-30%20at%2011.07.07%20PM.jpeg

WhatsApp%20Image%202022-04-30%20at%2011.06.52%20PM.jpeg

Overall results of classifiers:

WhatsApp%20Image%202022-05-01%20at%2010.19.49%20AM.jpeg

Results of Neural Network Model

Screen%20Shot%202022-05-01%20at%2010.59.22%20AM.png

Conclusion

Restate your project focus explain why it’s important. Make sure that this part of the conclusion is concise and clear. In HCDR project we are using Home Credit’s data to predict the customers who can repay the loan amount who has no credit history. This is intern, upscaling the livelihood by proving loans for people with low credit history. In the process of building Machine Learning model for risk detection of customers for loan repayment. The main objective is to identify the potential Defaulters based on the given data about the applicants. The probability of classification is essential because we want to be very sure when we classify someone as a Non-Defaulter, as the cost of making a mistake can be very high to the company. In this phase, after proving our hypothesis that tuned machine learning techniques can outperform baseline models to aid Home Credit in their evaluation of loan applications.

Restate your hypothesis (e.g., ML pipelines with custom features can accurately predict HCDR or Cats/Dogs) Feature engineering turned out to be the most crucial part for building the accurate classifier. Before Phase 2 that is before performing Feature engineering, the accuracy score was around 63%. But, after the feature engineering process, after feature aggregation, doing secondary and tertiary datasets, the accuracy increased to 73%. ML pipelines with custom features can accurately predict HCDR.

Summarize main points of your project: Remind your readers your key points. Phase 1: we had analyzed the features of dataset, compared each feature relation with the target, identified top 14 correlated features, analyzed the missing values and analyzed the distribution of each feature. We have chosen subset of features which are highly correlated based on the correlational analyses done. The result for this Kaggle submission has 73% accuracy which we believe is a very good start.

Phase 2: Simple baseline model, data modeling with feature aggregation, feature engineering, and using various data preprocessing pipeline both increased & reduced efficiency of models. Models used for prediction were Logistic Regression with PCA to handle multicollinearity, ensemble model approaches using gradient boosting, Xgboost, and SVM. Our best performing algorithm was XGBoost with the best AUC ROC score as 71.85%. The lowest performing model was SVM. Related ensemble models, Gradient Boosting has shown lower results with AUC ROC score 71.52% validation. Our best score in Kaggle submission out of all four submission was 0.72720 for private and 0.73006 for public.

Phase 3: Implemented Neural Network with simple and multilayer perceptron. We used XGBoost to predict loan default and were able to reach an AUC score of over 0.72 on our Kaggle submission. After focusing on exploratory data analysis. Along with building upon our additional features and developing boosted models, our team implemented multiple Multi Linear Perceptron models, experimenting with different architectures and activation types. Our top performing MLP model was built using Scikit-learn’s SVMN Classifier, yielding a validation AUC of 0.526, lower than our best performing, “soft voting classifier,” model.

Discuss the significance of your results Our Simple Neural network ROC score was 76.21% and multilayer neural network ROC score was 72.60%. By this we can conclude that simple network performed better than multilayer neural network. Our best performing model is Linear Regression model and XG Boost with almost similar test and train Kaggle accuracy. The overall highest Private score is 72.28% (XG Boost)and Public score is 72.61%(Logistic Regression).

Discuss the future of your project. Although we are done with the case study, there are still a couple of things which we had in mind, but couldn’t try due to time and resource constraints. One thing that we tried to implement, but couldn’t proceed further with was the Sequential Forward Feature Selection for selecting the best set of features. Given the number of features, this had a very high time-complexity and due to the unavailability of strong computational capabilities, we could not implement it. We believe that we haven’t utilized the concept of stacking appropriately in this case study. We can achieve an even better score by performing Stacking of diverse base classifiers, which would be trained on different sets of features, probably around 15–20 base classifiers which could give very strong results.

References

Some of the material in this notebook has been adopted from here

TODO: Predicting Loan Repayment with Automated Feature Engineering in Featuretools

Read the following: